Natural language processing
30%
Issued: 2019-3-18
Due: 2019-4-18 12:00

Very basic open information extraction implementation


Sketch:

1. (might have to do myself to keep data load down)
Tokenise (Skip stemming? Word vector handles that!)
Give them word vectors, culled to words required
https://nlp.stanford.edu/projects/glove/
 - Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)
PCA, keep only 2D, whitten/copula to fill uniform 2D square

2.
Histogram of P(word vec | label) - 2D (don't know if this will work)
Bayes rule to do classification of label - sucks
Or - classifier, then calibrate distribution? Excuse to put an ensemble in?

Dataset: https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus (version 4, but without stupid features)
Has POS and IOB

3.
Markov chain/dynamic programming to improve labels

4.
Information extraction through simple pattern matching
Need a db of patterns

5.
Train model to calculate probability of valid extraction?
- could be an ensemble?
What's the data?

6.
Could get them to visualise resulting graph? An application?
Question answering system? Need to think about as kinda tricky.

